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Abstract In density estimation task, maximum entropy model (Maxent) can effec- 
tively use reliable prior information via certain constraints, i.e., linear constraints with- 
out empirical parameters. However, reliable prior information is often insufficient, and 
the selection of uncertain constraints becomes necessary but poses considerable imple- 
mentation complexity. Improper setting of uncertain constraints can result in overfit- 
ting or underfitting. To solve this problem, a generalization of Maxent, under Tsallis 
entropy framework, is proposed. The proposed method introduces a convex quadratic 
constraint for the correction of (expected) Tsallis entropy bias (TEB). Specifically, we 
demonstrate that the expected Tsallis entropy of sampling distributions is smaller than 
the Tsallis entropy of the underlying real distribution. This expected entropy reduction 
is exactly the (expected) TEB, which can be expressed by a closed-form formula and 
act as a consistent and unbiased correction. TEB indicates that the entropy of a spe- 
cific sampling distribution should be increased accordingly. This entails a quantitative 
re-interpretation of the Maxent principle. By compensating TEB and meanwhile forc- 
ing the resulting distribution to be close to the sampling distribution, our generalized 
TEBC Maxent can be expected to alleviate the overfitting and underfitting. We also 
present a connection between TEB and Lidstone estimator. As a result, TEB-Lidstone 
estimator is developed by analytically identifying the rate of probability correction in 
Lidstone. Extensive empirical evaluation shows promising performance of both TEBC 
Maxent and TEB-Lidstone in comparison with various state-of-the-art density estima- 
tion methods. 
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1 Introduction 



The maximum entropy (Maxent) app roach to density estimation was originally pro- 
posed by E. T. Jaynes Javned . 1 1957t ] , and since then has been widely used in many 
are as of computer science and statist i cal le arning, especially natural language process- 



ing 



Berger et all 1 19961 . IPietra et all Il997ll. The Maxent principle can be traced back 



to Jaynes' classical description [Javned. Il957t 



". . . the fact that a probability distribution maximizes entropy subject to certain 
constraints representing our incomplete information, is the fundamental property which 
justifies use of that distribution for inference; it agrees with everything that is known, 
but carefully avoids assuming anything that is not known. . ." 

In implementing this principle, given a sampling distribution drawn from the un- 
derlying real distribution, Maxent computes a resulting distribution whose entropy 
is maximized, subject to a set of selected constraints. The standard Maxent can be 
formulated in Formula [T] 



max S[P {m) 



(1) 



St. | pj — Ou\ < S u Vu G U 

^2 Pi = a * or > 

ies c 

pi > a* or, Vc G C 

ies c 

where p( m ) = (pi, . . . ,p m ) is the resulting m-nomial probability distribution, S[-] 
denotes the Shannon entropy of some probability distribution, C and U are two index 
sets, a*, c G C is the constant determined by reliable information, a u and 6 U , u € U 
are the parameters that need to be empirically adjusted, and S c , c G C and S u ,u £ U 
are subsets of {1,2, ... ,m}. Standard Maxent has two sets of constraints. The first 
set (indexed by C) includes all certain constraints, which are derived from reliable 
prior information and do not involve empirical parameters. For example, two of the 
most common certain constraints are "Y^iPi = 1 and pi > 0. The second set (indexed 
by U) includes all uncertain constraints, which are from less reliable knowledge or 
sample information, and hence necessarily involve empirical parameters (e.g., a u and 
5 U ) to gain a satisfying performance. Note that there could be other specific forms 
of constraints not listed in Formula [TJ e.g., the common form of real-valued feature 
functions. However, these constraint forms are essentially equivalent to and can be 
categorized into certain or uncertain constraints. Moreover, all certain and uncertain 
constraints considered in this paper are linear. 



1.1 The Problem 



Although the essential idea of Maxent is concise and elegant, the implementation of 
Maxent poses considerable practical complexity. Specifically, in a typical density esti- 
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mation task, reliable prior information is often insufficient. In this case, if Maxent only 
involves certain constraints derived from reliable prior information, the resulting dis- 
tribution will be away from the sampling distribution. Consequently, underfitting will 
result. Hence, Maxent usually involves a set of uncertain constraints, which force the 
resulting distribution to be close to the sampling distribution. The tolerable violation- 
level of the resulting distribution against the sampling distribution is controlled by a 
set of threshold parameters. These constraints and parameters are essentially empirical 
and ad-hoc. This is a dilemma: On one hand, if a large number of uncertain constraints 
and a set of tight threshold parameters are involved, the solution of M axent will be 
close to the sampling distribution and might severely overfit the sample Dudik et all 
120071 : On the other hand, if a small number of uncertain constraints or a set of loose 
threshold parameters are used, Maxent might underfit the sample and miss out some 
useful sample information. 



1.2 Existing Work 

In the framework of Maxent, main approaches to tackling overfitting or underfitting are 
parameter regularization and constraint relaxation. The former introduces some spe- 
cific statistics (e.g., l\, Z|, h + \\ etc.) as the regularized terms of the objective func- 
tion and removes explicit constraints [Dudik et all 120071 . IChen and Rosenfeldl . l200d . 
iLebanon and Laffertvl . [200ll .[Lau. 1994;]. The latter aims to relax the constraints accord- 
ing to some theoretical conside rations [Khudanpuil Il995l . iKazama and Tsuiiil. 120031. 



Jedvnak and Khudanpurl 120051 ], e.g., Maximum Likelihood set in [jedTnlik"lmdKnudanpuii 



2005]. The performance guarantee of some Maxent variants is rigorously established 
with respect to (w.r.t.) finite sample criteria, e.g., Probably Approximately Correct 
(PAC). However, according to our best knowledge, most of guarantees are, to some ex- 
tent, self-referencing. For example, using log loss as the criterion, theoretical relations 
between t he solution o f Gene ralized Maxent (GME) and the best Gibbs distribution 
are given [Dudik et al .. 2007], However, the definition of the best Gibbs distribution 
intrinsically depends on the selection of feature functions. It turns out that, if the selec- 
tion of feature functions is improper, the solution of GME might not be able to avoid 
overfitting or underfitting substantially even if it is close to the best Gibbs distribution. 



1.3 Our Approach 

In this paper, w e propose a novel genera lization of Maxent, under the framework of 
Tsallis entropjQ [Tsallisl . [l988l . I Abel. l2000l ]. 

An important motivating observation is that, the expected Tsallis entropy of sam- 
pling distributions is always smaller than the Tsallis entropy of the underlying real 
distribution. To demonstrate this formally, we present a theoretical analysis on the ex- 
pected Tsallis entropy bias (TEB0 between sampling distributions and the underlying 
real distribution. The TEB is independent of the selectioijf] of constraints and can be 

Please refer to Section l3.1l for more details about Tsalli s entropy. For the sake of analytical 
simplicity, we only consider the Tsallis entropy with q = 2 Tsallis, 1988] in this paper. 

2 In this paper, the notation of "Tsallis entropy bias" has the same meaning as the "expected 
Tsallis entropy bias" . Accordingly, TEB has the "expected" sense in itself. 

3 Actually, the TEB only depend on i.i.d. sampling presumption. 
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expressed by a simple closed-form formula of the sample size n and the Tsallis entropy 
of the underlying real distribution. This observation naturally entails a quantitative 
re-interpretation and a theoretical guarantee of the Maxent principle: Since the en- 
tropy of sampling distributions is smaller, in the expected sense, than the entropy of 
the underlying real distribution, Maxent should increase the entropy of the sampling 
distribution to compensate the TEB and hence approximate the underlying real dis- 
tribution. The TEB is first developed in the frequentist framework and we notate it as 
Frequentist-TEB. In addition, by assuming a uniform Bayesian prior over all possible 
m-nomial distributions, a Bayesian-TEB is developed. 

We argue that, in consistency with the basic principle of Maxent, a rigorously 
established compensation of Frequentist-TEB or Bayesian-TEB can help alleviate the 
overfitting problem. On the other hand, it is natural to overcome underfitting through 
simply forcing the resulting distribution to be close to the sampling distribution. By 
integrating these two strategies into our generalized Maxent, called Tsallis entropy 
bias compensation (TEBC) Maxent, it is expected that TEBC Maxent can alleviate 
overfitting and underfitting. Note that it is somewhat problematic to develop the similar 
method in the framework of Shannon entropy since a consistent and unbiased correction 
of Shannon entropy has not been exactly found yet, in general (see Section [5] for more 
detials of the estimate of Shannon entropy) 

In implementation, the TEBC Maxent is convex and hence can be efficiently solved. 
More importantly, TEBC Maxent can bypass the selection of uncertain constraints 
as well as parameter identification by introducing a parameter- free TEB constraint, 
aiming at quantitative entropy compensation. 

In addition to the above Maxent framework, the generality of our theoretical results 
can be demonstrated by a practical connection between TEB and another widely used 
estimator, namely the Lidstone estimator. We will show that both Frequentist-TEB and 
Bayesian-TEB can offer guidance to identify the adaptive rate of probability correction, 
which needs to be empirically set in Lidstone. Accordingly, the so called "F-Lidstone" 
and "B-Lidstone" estimators are derived respectively. 

Extensive experimental results on a number of synthesized and real-world datasets 
demonstrate a promising performance of TEBC Maxent, F-Lidstone and B-Lidstone, 
in comparison with various state-of-the-art density estimation methods. 



2 Related Work 



The concept of maximum entropy has been existing in the Machine Learning litera- 
ture for a long time and has resulted i n various app r oaches. Its constrained form has 
been widely applied in many co ntexts Berger et al.l . 1 19961 . iKazama and Tsuiiil . I2003I . 
Ijedvnak and Khudanpurl |2005|. Recently, ther e have been many s t udies of Maxent 



wit h ?i -style regular i zation [Khudanpu: 



N 2 



1994, 



1995 



20041. ICoodmanl. I2004I. iKrishnapuram et al 



Chen and Roscnfcld. 200' 



if! 



Kazama and Tsuiiil. 2003, Williams 



Lebanon and Laffertv 



20051, io-style regularization 



2001, Zhang, 20051 as well as 



Lau 



some other types of r egularization suc h as l± + /o'Stvle [Kazama and Tsuiiil . |2003|| . 
^2-style regularizatio n Newman . 1977] and a smoo thed version of Zi-style regular- 



ization 



Dekel et all I2003I ]. lAltun and Smolal 2006J derive duality and performance 



guarantees for settings in which the entropy is replaced by an arbitrary Bregman or 
Cs iszar divergence. A thoroughly theoretic analysis of regularized Maxent can be found 
in [Dudik et all l2007j . 
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As another direction to density estimation, there are many smoothin g methods that 



have been proposed in vari ous contexts, e.g., informati on retrieval tasks 



200 1| , speech recognition Chen and Goodman! . Il998t | and cryptology G*oociriT953l ] . A 



Zhai and LaffertvL 



typical family of general-purpose smoothing methods is Good- Turing estimator. All of 
them use the following equation to calculate the resulting frequencies of events: 

(N X + 1) E(N X + 1) 

*x = 



T E(N 



x) 



where X is an event, Nx is the number of times the event X has been seen, within the 
sample of size T, and E(n) is an estimate of how many different events that happened 
exactly for n times. Different variants of Good- Turin g estimator, e.g., the sim plest 
Good- Turing estimator, Simple Go od- Turing estimator Gale and Sampsonl . Il995| and 



diminishing-attenuation estimator Orlitskv et all 120031 ] , are based on different calcu- 
lations of the E(- 



Another widely used smoothing method is Lidstone estimator Chen and Goodman! . 



1998] . Typical variants of Lidstone estimator include Expected Likelihood Estimator, 
Laplace estimator and Add-tiny estimator. From a theoretical point of view, by defining 
the attenuation of a probability estimator as the largest possible ratio between the 
per-symbol probability assigned to an arbitrary sized sequence by any distribution and 
the corresponding probability assigned by the estimator, it can be shown th at the 
attenuation of diminishing-attenuation estimators is unity Orlitskv et alll2003| . Note 



that the attenuation analysis is an asymptotic analysis w.r.t. large sample performance. 

For the entropy correction, there are some methods to estimate the Shannon en- 
tropy bias. However, to the best of our knowledge, no consistent and unbiased correc- 
tion of Shannon entropy has been developed. In principle, the " i nconsistency" theorem 
leads to several approximations of the Shann on entropy bias Miller! . 1 19551 . ICarltonl . 



Il969l . IPanzeri and Treves! . 1 19961 IVictor! . l200d ]. The analytical approximation of bias 
given by [Paninskl 2003] can be considered more rigorous than predecessors. However, 
this b ias does not ha ve a closed form and depends on specific prior distribution and 
the c [Paninskiliooj ], and hence it is hard to be computed in general. Regarding this, 



Paninski proposed an estimator, which is consist ent even when t he c is bounded (pro- 
vided that both m and n are sufficiently large) Paninskil . |2004| . However, a general 



and exact closed-form formula for Shannon entropy bias is still an open problem. 

The rest of this paper is organized as follows: Section [3] gives a theoretical analysis 
on two crucial observations; Section|4]discusses the estimate of TEBs (Frequentist-TEB 
and Bayesian-TEB) , introduces TEBC Maxent and reveals the connection between 
TEBs and Lidstone estimator; Section [S] gives two model-evaluating criteria; Exper- 
iments on synthesized and real- world datasets are constructed in Section [6] and the 
experimental results are reported and discussed. Finally, conclusions and future work 
are presented in Section [7] 



3 Tsallis Entropy Bias 



In this section, we present two theoretical observations, which motivate and underpin 
the proposed TEB, TEBC Maxent and TEB-Lidstone estimator. 



6 



3.1 Notations and Definitions 

We use the following notations throughout the rest of the paper: 

p(m) s ,Pm) : The underlying real m-nomial (m > 2) probability distribu- 

tion, where YliLl Pi = 1; 

P n ' = (pi, ...,p m ) : The sampling distribution of sample size n w.r.t. p( nl \ where 

Em ^ -, 
i=xPi = !; 

p( m ) • The se ^ Q f a vj possible m-nomial (m > 2) probability distributions. 

T q [pM] = fc j— Sfei^i : The Tsallis entropy of P (m) w.r.t. the index 5, where k 

is the Boltzmann constant; We have lim T q |p( m )] — k ■ H [p (m) ] , where J/ |p( ro )J 

is the Shannon information entropy of p( m > . 

The Tsallis entropy is the simplest entropy form that extends the Shannon en- 
trop y while maintain i ng the basic pro perties but allowing, if q 7^ 1, nonextensiv- 
ity [Santos and Mathl . 1 1991 UbeL |2000| . In this paper, for the sake of analytical con- 



venience, we always assume q = 2 and neglect the subscript q. In addition, without 
loss of generality, we omit the Boltzmann constant. It then turns out that T |^p( m )J = 



3.2 Main Results 



Proposition 1 Given an arbitrary m-nomial probability distribution p( m ) £ p( m ) ; \ e t 
P n be the sampling distribution of sample size n with respect to p( m ) , Ep 1 ^ (T) be 

the expected Tsallis entropy of P^ \ and T |p( m )J be the Tsallis entropy of P^ m \ 



then 



E< pi ( T ) = ~ \ p(m) ] (2) 



Proof Let p( m ) = (pi, ■ ■ ■ ,p m _i, 1 — Y^iLi^ Pi) be an m-nomial probability distribu- 
tion. Denote the set of all possible sampling distributions of sample size n as 



,(m) _ J ft< m) _ / xi Xm-1 n ~ l^i=i x i 

_rv, — \ , . . . , , 

n n n 



xx,... ,x m -i e N, x\ H h x m -x < ri 



where Xi is the count of the i nomial. Given p' m ' ; the occurrence probability of 
a sampling distribution P„ = / £l, . . . , Xm n % , - — ^^ =1 x% \ 6 S-L"'^ is given by the 
following equation: 

pr [pr \p^} = ^ — j— (1- r 1 ^"^ 1 Xi rr~> 

(3) 

iVote i/iat u>e assume 0° = 1 in Formula^ Hence, we have 
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the definition of Tsallis entropy (q — 1), we have 



5-^=1 ( n ) 



where we denote n — XZIILi 1 as Xm - R 1S convenient to express Ep* (T) as the 
following: 

where 

* (?) 2 - e;,=o e;;: - ■ • e:;S" - (?) 2 - - (** i**). <-•••« 

iVote t/iai E p ' ( x i ) is just the moments about the origin of the multinomial distribution 



and given by Ep" 1 ^ { X T) = n ( n — l)P?+ n Pi- Hence, we have Ep" 1 ^ (tt) 2 ~ ~ — +Pl 
It turns out that 

(m) ,„„ _ , {n-l)pf+ Pi _ n-1 



?) ( T ) = 1 -y" yn-im^Pi _________ i-T pi 

T/ie r./i.s o/ t/ie /ast equation is just T |^p( m )J . 
Corollary 1 

lim _.£"> (T) = T [P (m) j (4) 
Proof The corollary follows immediately from Formula^ 

The above result is developed in the frequentist framework and hence corresponds 
to a Frequentist-TEB. In the following, a uniform Bayesian TEB (Bayesian-TEB for 
short) is developed by assuming a uniform Bayesian prior over all possible m-nomial 
distribution^. 

Proposition 2 Given the uniform probability metric over p( m ), the expectation of 

n(m) r-i( m ) - ■ L 

Ep ' , i. e. En , is given by 

F (m) m _ (n-l)-(m-l) 

tn n-(m + l) (5) 

Proof By the definition of mathematical expectation, we have 



£ « m) ( T ) = Th^T) l l Pl ■ ■ ■ l El=1 V ' S K & d P™-ldpm-2 ■■d P1 

£K 'Jo Jo Jo 



where Z^ m l ' is the normalization factor determined by the (m — l)-order integral 
operator, 



/ "' — / ■•'/ dp m -\dp m -2 ■ • ■ dpi 

Jo Jo Jo 



(m-1)! 



4 The code package for the numeric evaluation of Proposition 1 and 2 is provided at 
http: //www. comp . rgu.ac.uk/staff/pz/TEBC/Proposition_Validation_Codes . zip 
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Expands and rewrites Ep^ (T) as the following: 



E (m) {T) = 2(n-l) 



Em — ± r — ^ in — ± 2 v — ^ m — V — ^ ^ — 1 



(6) 



Hence, we have 



4 m) (r)- 2(n_1) 



1 rl-pi 



Z(m-1) . n 
m — 1 



JO JO 
v m — 1 



7" 

./o 



i-E™i p. 



dp m -ldpm-2 ■••dpi 
To simplify the notations, we define the (m — 1) -order integral operator: 

L (m-1) = / / ... / 
Jo Jo Jo 

We denote l/" 1 ^ (/ {pi, . . . ,p m -l)) as 



dp m -ldp m -2 ■■■dpi 



(7) 



L 



(m-l) 



(/(Pi 



,Pm-l)) 



1 r l- Pl 



/(pi 



,Pm-l) dp m -ldp m -2 ■ ■ ■ dpi 



JO 



Jo Jo 

Then the proof of Formula [5| is reduced to solve the closed-form of 

* ^2=1 Z ^Z— 1 * ^Z— 1 Z ^J — l-\-l 

_Due io i/ie symmetry of integral domain, Lv m— has i/ie following properties: 

(a) L^-^ipi) = L^-Vfa), l<i,j<m-l 

(b) L^-^ipf) = L^ m - l \p%l < i,j<m-l 

(c) L^-^ipi-pj) = L(™-V(p k - Pl ),l< i,j<m-l,ifLj,k*l 

Therefore, if the general termformulae of L l ' m ~ 1 \p i ) , l/" 1-1 ) (p 2 ) and (p^ -p^) ,i / 

j, and i/ieir term numbers are available, the general term formula of E^ (T) could be 
obtained directly. 

The following general term formulae can be verified: 

(a') L(— 1 >fe) = ^T 
(b')L^ (pt)=^ 
(c')L^-V ( Pi .p j )=^± w 

then 



z( 

2(n- 1) 
Z(m-l) . n 



m-l 2 (m-l) (m-l)-(m-2) 



(m + 1)! 



2(m + 1)! 



Recall that Z^ m ^ = (m— 1)1 J ana ' a /^ er S07Tle simplification steps, Formula^ is ob- 
tained from the above equation, which completes the proof of Proposition 
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Corollary 2 Let E^ (T) = £ J^ 1 • ■ • P * T d Pm ^dp m . 2 .-dpi, 

where Z^ m ~*' is the normalization factor -, — ^prr. Then we have 

lim E ( ,r l) (T) = E (m) (T) (8) 

n— >oo 

Proof The corollary follows directly from the fact that E^ m ^ (T) can be given by the 
r.h.s of Formula^ except for a multiplicative factor ^-E^- 



4 Density Estimate based on Tsallis Entropy Bias 

Based on the above theoretic results, we first discuss the issue on the estimate of Tsallis 
entropy bias, and then propose density estimate methods in Maxent framework and 
Lidstone framework, respectively. 



4.1 On Estimation of Tsallis Entropy Bias 

To apply the result of Proposition [1] the frequentist Tsallis entropy bias (Frequentist- 
TEB) should be effectively estimated so that the Tsallis entropy of the sampling dis- 
tribution P n ' can be compensated accordingly. According to Formula[2] the expected 
Tsallis entropy of P n , i.e., E*p (T), is (n — l)/n of the Tsallis entropy of the under- 
lying real distribution P {m \ i.e., T |p<»J . We denote the estimation of E^ (T) as 

Epll ( T )> then n=T-^j> ( T ) can be considered as an estimation of T [p( m )l . Hence, 
the estimation of frequentist Tsallis entropy bias, i.e. Frequentist-TEB, is given by 

AT = -^ jE W(T)-T[pt ) ] (9) 

The simplest (and unbiased) estimation of fip^ (T) is given by T j^P^ ' j , and hence 
the corresponding estimated TEB is given by 

at = ^- 1 t[p^\ (10) 

Remark 1 T \Pn 'j is an unbiased estimate of Ep H ^ (T). Therefore, Formula ] 1 (A gives 

an unbiased correction of T j^P* ' 1 . In addition, the consistency of this correction fol- 
lows from the law of l arge n umbers (if m is finite) or central limit theorem for multi- 
nomial sums )Morris\. \l97d] (if m is infinite). 

Surprisingly, experimental results (detailed in Section [6]l show the TEBC Maxent 
and TEB-Lidstone estimator based on this naive estimator can outperform all com- 
parative density estimation models in most cases. 

In many cases, using statistical re-sampling techniques, e.g., Bootstrap Wassermanl . 
120061 ]. we can achieve more accurate estimations of T |^p( m 'J , and hence obtain better 
estimations of Frequentist-TEB. In the following, we give an estimation procedure of 
T |^p( m )J , which is optimal in the sense of the least squared error. 
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Let us rewrite Formula[2]so that Ep^} (T) is expressed by a function of n 

4 U 1 (T) = K~ 



where K is a constant slope and determined by T |^p( m ) J . We can estimate Ep™) (T) , 1 < 

i < n by re-sampling techniques and obtain Ep^ (T) , 1 < i < n. Note that the re- 
sampling is meaningful only if i < n. The remaining task is to solve an unconstrained 
quadratic program so that the squared error 



En 
i=l 



(11) 



is minimized. This minimum corresponds to the zero point of the first derivative in the 
cost function 



4? ^- l -r- K } 2 



dK 

By expanding the above equation, it turns out that the estimated slope is given by 



K 



ElLi(^-i) 2 A 2 



where K is the estimation of T |^p( m )J . Note that, in practice, it is often sufficient to 
only involve an appropriate subset of Ep) (T) , 1 < i < n, in the cost function pip 
to construct the estimate of T 

p(r 

However, even with the re-sampling technique, in the case that the sampling is 
seriously inadequate, it seems still difficult to estimate Ep™^ (T) accurately, which 
might in turn result in inaccurate Frequentist-TEB. In this case, some Bayesian prior 
over the space of all possible real distributions might be more useful. The results of 
Proposition [5] and Corollary [2] guide the construction of uniform Bayesian-TEB, i.e., 

AT — (12) 
n-(m+l) y ' 

by assuming the uniform probability metric over all possible real distributions. Note 
that uniform Bayesian-TEB is directly obtained by computing the difference between 
the expected Tsallis entropy of all possible real distributions and the expectation 
of E { pl (T), i.e. E n m) (T), w.r.t. uniform prior. Hence, it avoids the estimation of 

Remark 2 Corollary^ shows that E n m ^ (T) is the asymptotically unbiased estimate 
of ijf" 1 ) (T) and can be unbiased if the Bayesian-TEB ^T^jm * s compensated. In 

addition, this corrected estimator is also consistent with E^" 1 ^ (T). 

In summary, Remarkl and Remark2 show that the estimates of the Tsallis entropy 
bias, w.r.t both Frequentist and Bayesian frameworks, can be considered sound in the 
sense of unbiasedness and consistency. Note that the Shannon entropy estimate lacks 
these guarantees. 
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4.2 TEBC Maxents 

In this subsection, we propose three Tsallis entropy bias compensation (TEBC) Max- 
ents to compute resulting distributions. These distributions are most similar to the 
sampling distribution w.r.t. different similarity criteria, subject to the constraint that 
the estimated Frequentist/Bayesian-TEB, denoted as AT, is forcibly compensated. 
This strategy can help to alleviate the overfitting and underletting problems. 

Given any m-nomial sampling distribution P„ = (pi, . . . ,p m ) of sample size n 
and the estimated AT, the TEBC Maxents can be constructed to compute the resulting 
distribution p( m ) = (pi, . . . ,p m ) w.r.t. the criterion of /| norm, Jensen-Shannon (JS) 
divergence (see Formula \T5\ for details) or Maximum Likelihood, respectively: 

Model 1: l\ Tsallis Entropy Bias Compensation (Ij-TEBC) 

Em 2 
. . [Pi ~Pi) 
1=1 

s.t. T [p<" 1 )] > T [P 7 [ m) } + AT ( 13 ) 
Certain Constraints 

Model 2: JS-Divergence Tsallis Entropy Bias Compensation (JSD-TEBC) 

min JSD \p^ \Pi m) ] 

P(m) L I J 

s.t, T [p( m >] > T [?i m) ] + AT ( 14 ) 
Certain Constraints 

where JSD [-|-] denotes the JS-divergence. 

Note that a common statistic to measure the divergence between two probability 
distributions is the Kullback-Leibler (KL) divergence. Despite of the computational and 
theoretical advantages of KL-divergence, it is not symmetric in its arguments. Reversing 
the arguments in the KL-divergence function can yield substantially different results. 
Furthermore, KL(P, Q) may be seriously underestimated if P involves zero terms since 
linip^o Pi log 2i = 0. Besides, KL(P, Q) is sensitive to penalty terms used in the case 
of 5j = 0. Hence, we apply a symmetrized variant of KL-divergence, i.e., JS-divergence, 
instead. 

JSD[P\Q] = ±D\P\M)+'~D\Q\M) (15) 

where D[P\M] is the KL-divergence from P to M and M = (P + Q)/2. 

Model 3: Maximum Likelihood Tsallis Entropy Bias Compensation (ML- 
TEBC) 

max log{Pr[p^ m) |p (m) ]} 

s.t. T [P (m) ] > T [P { n m] ] + AT ( 16 ) 
Certain Constraints 

where Pr \p^ n) |p (m) j is given by Formula |3 

In Models 1, 2 and 3, all objective functions aim at forcing the resulting distribution 
similar to sampling distribution as well as possible. This is to counter the underfitting 



12 



problem. Meanwhile, the TEB constraint is used to increase the entropy of the result- 
ing distribution and then compensate the Tsallis entropy bias, in order to make the 
resulting distribution approximate the underlying real distribution. This is to avoid the 
overfitting problem. From another point of view, in the expected sense, the objective 
function aims at reducing the Tsallis entropy, while the TEB constraint is adopted to 
necessarily increase the Tsallis entropy. Through this joint effort, the objective function 
forces the TEB constraint to hold as equality, which is consistent with our previous 
theoretical analysis. 

In implementation, all the above objective functions and constraints are convex. 
Therefore, T EBC Maxents can be globally so lved by efficient methods, e.g., the inte- 



rior method [Boyd and Vandenberg hc. 2004]. In specific application contexts, TEBC 



Maxents should include certain constraints, which are derived from reliable prior infor- 
mation and do not involve empirical threshold parameters. A specific example on the 
form of certain constraints is given in our experiment (see Section T6. 2. 21 for details). 

TEBC Maxents can be constructed with Frequentist-TEB or Bayesian-TEB. In the 
cases that sampling process is seriously inadequate, the Bayesian TEBC Maxents are 
expected to have stable performance. The cause is that, given uniform Bayesian prior 
and an inadequate sampling, e.g., n ~ m, the standard deviation of Ep^(T) tends 
to be negligible compared to uniform Bayesian-TEB, which implies a relatively stable 
estimation of uniform Bayesian-TEB. The detailed proof is given in Proposition [3] of 
Appendix A. 



4.3 TEB-Lidstone Estimators 

There is a natural connection between TEBs and Lidstone estimator. Lidstone's law 
of succession suggests the family of Lidstone estimators in the following form: 

_ Xi + / 

Pi = — — ; 

n + j ■ m 

where n is the sample size, m is the number of nomials, Xj, is the count of the i th nomial 
and / is a parameter indicating the rate of probability correction (normally between 
and 1). When / = 0.5, it turns out to be the well-known Expected Likelihood Estimator 
(ELE), i.e., 

_ Xi + 0.5 
Pl ~ n + 0.5-m 

Another two common Lidstone estimators are add-one estimator (/ = 1) and add- 
tiny estimator (/ = 1/n). The smaller / is, the less probability mass it compensates 
for underestimations. There exist some explanations on the selection of parameter /. 
For example, ELE gives a Bay esian justificat i on by assuming a uniform prior for a 
binomially distributed variable Box and Tiaol . Il973| . However, in general cases, / is 



empirically configured. 

TEBs offer a set of criteria, either of which analytically identifies an adaptive / 
w.r.t. a specific input sample, and derives the TEB-Lidstone estimator. The fundamen- 
tal idea is to solve such an / so that the Tsallis entropy bias of the input sample is 
quantitatively compensated. That is 



^-^i \n + f ■ m J L J 
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where AT can be Frequentist-TEB or Bayesian-TEB. Let a = 1 - [T iPi" 1 '] + AT J . 
It turns out that we have the following quadratic equation in the single variable /: 

(am 2 - m) f 2 + 2n (am - 1) / + [kn 2 - ^™ xf\ = (17) 

Occasionally, Formula [T7] has not real roots. In this case, we can simply select / 
corresponding to the minimum (if am 2 — m > 0) or the maximum (if am 2 — m < 0) 
of the l.h.s of Formula 1 171 Consequently, the so called "F-Lidstone" and "B-Lidstone" 
estimators are derived on Frequentist-TEB and Bayesian-TEB, respectively. 

Note that, in principle, the Tsallis entropy bias can also serve to identify the pa- 
rameters of some other estimators, e.g., the multiplicative parameter of Good- Turing 
estimator. We omit the computation details here. 



5 Evaluation Criteria 

The performance of a density estimation method can be directly evaluated by mea- 
suring the similarity between its solution and the underlying real distribution. As 
mentioned in Formula 1151 JS-divergence can be considered as a candidate criterion. 
In ad dition, we us e the expected log loss (the expect negative normalized log likeli- 
hood [Dudiket all \200^ ) as another similarity criterion. The log loss of a resulting 
distribution p( m ) = (pi, . . . ,p m ) with respect to the sample x = (x±, x%, . . . , x m ) is 
defined as: 

m 

Lp (m) (x) = - logp^ 1 ^ 2 ' ■•Pm m = -y] x i l °gPi ( 18 ) 

i=l 

Recall that, given a underlying real m-nomial distribution p( m ) = (p 1; . . . ,p m ) , the 
occurrence probability of a sample x = (ari, x%, . . . , x m ) of size n can be expressed by 

(m) 1 _ xi X2 



Pr x PW = - , , - , ■/-, /4- (19) 

w.r.t. p( m ) 

El [p(m) =J2 xeX L P<™> W Pr [ X | P(m) ] (20) 

where X stands for the set of all possible samples of size n. By substituting the r.h.s 
of Formula 1201 by Formula 1181 and QUI it can be checked that 

E L \P {m) \P {m) 1 = E £ logP, • X ^J... X r P?P? 



rn 

■ V logp, >: ; J; | ? f •• -pS 
i=i xeJf 

m 

- 2 iogPi ("Pi) 
i=l 



(21) 



i=l 

To more systematically measure the performance of different algorithms w.r.t a 
specific evaluation criterion, we introduce the Performance Score (PS) as below: 
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T>tfi(A\- C D (A)-C D (Worst) 

FbD{A) ~ C D (Best)-C D (Worst) (22) 

where A stands for an algorithm under evaluation, C represents an evaluation criterion 
and D denotes a specific dataset; the performance value Cry (A) is evaluated by criterion 
C for algorithm A running on dataset D, and Cry(Best) and C d(W or si) are the 
performance values of the best and worst algorithms on D, respectively. 



6 Experiments 

In this section, we construct three sets of experiment^]. First, if reliable prior informa- 
tion is available, in Maxent framework, TEBC Maxents' performance will be evaluated 
in comparison with standar d Maxent, GME (a ^-regularized Maxent which has PAC 
guarantee of performanc e) [D udik et all [2007]. as well as Maxents based on Shan- 
non entropy bias (SEB) [Millerl . 1 19551 ] which is described in Section [6.2.11 Second, in 
case that reliable prior information is not available, we will verify the effectiveness of 
TEB-Lidstone, by comparing their performance with comparative Lidstone and Good- 
Turing estimators, together with SEB-Lidstone which is derived from the above SEB 
in a similar way to TEB-Lidstone (described in Section [6T3TT]) . In all the above exper- 
imental settings, both synthesized and real-world datasets are employed. For TEBs, 
Bayesian-TEB is calculated by Formula 1 121 and Frequentist-TEB is estimated by the 
naive estimator given in Formula 1 101 which is more efficient in large-scale experiments 
and also can give a satisfying performance in both Maxent and Lidstone frameworks. 



6.1 Datasets Description 

First, synthesized probability distributions are generated to serve as the underlying 
real m-nomial distributions 

pirn) 

. To this end, we adopt a simple Monte Carlo method 
to randomly draw m positive points from a source distribution, and then normalize 
them to form an underlying real distribution. The source distributions we used in- 
clude uniform distribution U(0, 1), the absolute value of standard normal distribution 
|iV(0, 1)|, normal distribution A^(3, l 2 ), x 2 distribution x 2 (10), binomial distribution 
£>(30, 0.2) and beta distribution /3(3,6). After the underlying real distribution is gen- 
erated, a sample of size n is drawn from it, which can be then used to calculate the 
sampling distribution p( m ' . 

We also adopt four real-world datasets: UCI-Dextei__, UCI-Statlo£], UCI-ISOLET__ 
and UCI-SonaiQ in order to generate the underlying real distributions 

Text dataset: Dexter 

5 The source code is available at 

http: //www. comp . rgu.ac.uk/staff/pz/TEBC/TEB_Experiment_Codes . zip 

6 http: / /archive. ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/ 

7 http: / / archive.ics.uci.edu/ml/machine-learning-databases/statlog/ satimage/ 

8 http: / /archive. ics.uci.edu/ml/machine-learning-databases/isolet/ 

9 http: / /archive. ics.uci. edu/ml/machine-learning-databases/undocumented/connectionist- 
bench / sonar / 
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UCI-Dexter is a text dataset, containing 2000+300 documents. Each document 
is represented as a 20000-term count vector. Before the text dataset is actually em- 
ployed, it is preprocessed by dropping the terms that occur too frequently, i.e., the 
"stop words". After the preprocessing step, m terms could be randomly selected from 
the whole term set and the frequencies of these selected terms are considered as the 
underlying real distribution P(m). Then, we randomly choose a bag (size n) of words 
from all the documents as a sample. The frequencies of these terms in the sample are 
calculated and considered as the sampling distribution P(m). 

Non-text datasets: Statlog, ISOLET, Sonar 

UCI-Statlog (Landsat Satellite) dataset consists of all possible 3x3 neighborhoods 
in a 82 x 100 pixel sub-area of a single scene which is represented by four digital images 
in different spectral bands. A sample is then defined as the pixel values of each 3x3 
neighborhood in the four spectral bands (hence 4x9=36 features in total). The size of 
the dataset is 4435+2000. 

UCI-ISOLET dataset includes 150 subjects speaking the name of each letter of the 
alphabet twice. The speakers are divided into groups of 30 speakers. There are 617 
real-value features including spectral coefficients, contour features, sonorant features, 
pre-sonorant features, and post-sonorant features but in an unknown order. 

UCI-Sonar contains 111 patterns obtained by bouncing sonar signals off a metal 
cylinder at various angles and under various conditions, and 97 patterns obtained from 
rocks under similar conditions. The transmitted sonar signal is a frequency-modulated 
chirp, rising in frequency. The data set contains signals obtained from a variety of 
different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the 
rock. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents 
the energy within a particular frequency band, integrated over a certain period of time. 

For the above three non-text datasets, in order to generate the m-nomial underly- 
ing real distribution, we simply partition a randomly selected feature into m intervals 
covering the whole range of this single feature. Then the number of instances in each 
interval is counted and finally the underlying real distribution p( m ) is formed by nor- 
malizing the count vector. The sampling process is to first randomly choose n instances 
and distribute them into the corresponding intervals based on their feature value, and 
then form the sampling distribution p( m ) by normalizing this sampling count vector. 



6.2 TEBC Maxents vs. Comparative Maxents 

Maxent is widely used due to its effective use of reliable prior information. In this set 
of experiments where the reliable prior information is given, TEBC Maxents and other 
Maxents are compared in terms of their density estimation performance. 



6.2.1 Maxents 

Various forms of Maxents are tested, including Frequentist TEBC Maxents (F-i^- 
TEBC, F-ML-TEBC and F-JSD-TEBC), Bayesian TEBC Maxents (B-ljjl- TEBC, B- 
ML-T EBC and B-JSD-TEBC), standard Maxent (SME for short), and GME [Dudik et all 
l2007t ] (implemented by Zi-SUMMET and Zi-PLUMMET, which stand for selcctive- 
update and parallel-update algorithms for ii-regularization Maxent, respectiv ely). In 
addition, we also construct three Maxents based on Shannon entropy bias (SEB) [Miller! , 
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Il955l] . SEB Maxents are similar to TEBC Maxent (Model 1-3) except that the TEB 
constraint is replaced by the following SEB constraint: 

S [P {m) ] > S [Pi m) j + AS (23) 

where S[-] denotes the Shannon entropy of some probability distribution and AS — 
(m — l)/(2n) denotes SEfT^I. We use ™~ 1 as the correction since it is simple in form 
and freque ntly-used. Note that it can not be considered an unbiased correction, in a 
strict sense [Paninskil . [2003| . By substituting the SEB constraint for the TEB constraint 
in Model 1-3, three new Maxents are introduced in the experiment, namely Z^-SEB, 
ML-SEB and JSD-SEB. 

Note that in the following, we adopt "Sample" to represent the method using the 
sampling distribution as the resulting distribution directly. 



6.2.2 Certain and Uncertain Constraints 



Two kinds of constraints are involved. One is certain constraints, and the other is 
uncertain constraints. Certain constraints are derived from reliable prior information, 
which is incomplete information of the underlying real distribution. Specifically, certain 
constraints can be represented by a set of constraints as follows: 

J2 p* = a * = X) pi WceC ( 24 ) 

ies c ies c 

where S c is a subset of {1, 2, . . . , m}. In order to form this subset, we randomly choose 
a number \S C \ from {1,2, ... ,m — 1}, and then randomly select \S C \ indexes from 
{1,2,..., m}. If we do this step k times, k certain constraints can be derived, and then 
the set C of all certain constraints is formed. 

Uncertain constraints are derived from sampling information, together with empir- 
ical threshold parameters to control the similarity between the resulting distribution 
and the sampling distribution. Specifically, for every i, if pi > th, then we construct 
an uncertain constraint represented by Box Constraint: 

\Pi-Pi\<6 (25) 

where S and th are threshold parameters. In our experiments, we fix th as 0.2/m, and 
adjust 8 to find relatively optimal performance for standard Maxent and GME. 



6.2.3 Parametric Configuration 



For all the models, the same parameters are used, including generating times, bin 
number, sample size, and sampling times. Generating times is the number of times 
to generate the underlying real distribution. Sampling times is the number of times 
to draw the sampling distribution from the given underlying real distribution. Note 
that in order to avoid the zero probability of any bin in the m-nomial underlying real 
distribution, the bin number m should be set properly for each real-world dataset in 
terms of its scale. For example, Sonar dataset has a relatively small number of data 

10 In the original formula, an estimate of m is used instead of the real one. In our context, 
m is known in advance and hence the estimation can be avoided. 
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points. Therefore, we let ms onar = 30 to avoid a degenerated m-nomial underlying 
real distribution. 

All Maxents involve the same certain constraints. TEBC Maxents and SEB Maxents 
adopt the corresponding TEB and SEB constraints, while other three Maxents use 
uncertain constraints. For uncertain constraints, we choose S in Formula ES] from [le — 
4, 1.6e — 3] with increment le — 4. The performance of TEBC Maxents will be compared 
with the performance of comparative Maxents with the optimal <5. Under this optimal 
5, comparative Maxents can obtain relatively optimal performance compared with the 
performance using other <5's. The parametric configuration is listed in Table [T] 



Category 


Detailed Configurations 


Generating Times 


r = 10 


#Bin 


m Syn, Dexter, ISOLET = 100 
msAT = 50, msonar = 30 


Sample Size 


n = 10 ■ m 


Sampling Times 


s = 20 


#Certain Constraints 


k = 0.2 x m , k = 0.05 x m 


Threshold parameter 


8 = 6e — 4 chosen from [le — 4, 1.6e — 3] 



Table 1 Parametric Configuration in Maxent Framework 



6.2.4 Results 

We employ the parametric configuration in Table [1] to run every Maxent. The mean 
performance scores w.r.t. JS-Divergence and Expected Log Loss, averaged on r x s 
sampling distributions, are summarized in Tables El El SI and El In addition to the 
performance scores, we also give the best value and worst value w.r.t. the two eval- 
uation criteria. Finally, the overall performance of each algorithm, averaged over all 
synthesized and all real-world datasets, are shown in Tableland Table[7] respectively. 

When k = 0.2 x m, in experimental results on synthesized datasets, all TEBC 
Maxents outperform comparative Maxents in most cases. From Table[6] we can observe 
that, on average, F-ML-TEBC and B-ML-TEBC are the best two models among TEBC 
Maxents. The superiority of TEBC Maxents is clearly shown in Tables [2] and [6] under 
the JS-divergence measure. This superiority also holds under expected log loss measure, 
which is demonstrated in Tables El and El As for real- world datasets, we can still draw 
the same conclusion that TEBC Maxents show their advantages over the others from 
TablesHand[5] Like the results on synthesized datasets, F-ML-TEBC and B-ML-TEBC 
still show their robustness and become the best two models in the average sense from 
Tables E| and □ 

When k — 0.05 x m, the amount of prior information is actually reduced. In this 
case, it is also clearly demonstrated that all TEBC Maxents outperform comparative 
Maxents on average. Particularly, we can observe that F-Z^-TEBC and B-^-TEBC 
become the most stable and effective, somewhat differing from their performance in 
the case where k = 0.2 x m. 

We would like to mention that in our experiments, standard Maxent and l\- 
PLUMMET did not perform well. In many cases, using the sampling distribution 
directly (denoted as Sample) can outperform these two Maxents, especially in the 
experiment with real-world datasets. One of the causes is that when we fix a unified 
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Data 
McthoJT~~~------^_ 


TTld 1*1 


I A/Yn 1 2 ~il 

JV (V, 1 )\ 




Sample 


0.5983/0.8088 


0.7005/0.8838 


0.0000/0.0000 


F-^-TEBC 


0.8639/0.9572 


0.8695/0.9966 


0.9793/1.OOOO 


F-JSD-TEBC 


0.9985/0.9999 


0.9983/0.9992 


0.9919/0.9863 


F-ML-TEBC 


1.0000/1.0000 


1.0000/1.0000 


1.0000/0.9940 


B-ij-TEBC 


0.8630/0.9571 


0.8690/0.9962 


0.9759/0.9955 


B-JSD-TEBC 


0.9978/0.9993 


0.9981/0.9990 


0.9875/0.9803 


B-ML-TEBC 


0.9992/0.9994 


0.9997/0.9997 


0.9955/0.9879 


SME 


0.0000/0.0622 


0.0231/0.0660 


0.3774/0.6114 


ii-SUMMET 


0.0017/0.0000 


0.0000/0.0000 


0.6376/0.6428 


/. PT TTMMTTT 


U.Uf lo/U.UDOU 


n nom /n n^77 
u.uyui/u.uoi / 


O /O fi9QQ 

u.'izou / u.ozyy 




u.ooyz/u. i Joo 


n 7i on /n c t^na 
u. ( lyu/u.oouo 


U.ooyo/U.Uooy 


Ten QT?R 


U.yyo0/U.o4y4 


u.youy/u.yzyu 


n Q2£7 /n 11 79 
U.OOD/ /U.ll f Z 


MT OUD 
LVLLi-OI2jD 


n Szii n /n SAQfi 
u.o^tiu/ u.o^yo 


O 0^9 /D Q9QQ 

u.yooz / u.yzyy 


n ^Q9i /n 11 7fi 


Best JS Value 


0.0120/0.0143 


0.0133/0.0156 


0.0091/0.0113 


Worst JS Value 


0.0262/0.0338 


0.0281/0.0340 


0.0182/0.0190 


Data 

Method~~~----~^_ 


x i iu ; 






Sample 


0.0000/0.0000 


0.0000/0.0000 


0.0000/0.0000 


F-^-TEBC 


1.0000/1.0000 


1.0000/1.0000 


1.0000/1.0000 


F-JSD-TEBC 


0.8736/0.7449 


0.9243/0.8088 


0.9443/0.8759 


F-ML-TEBC 


0.8830/0.7560 


0.9347/0.8190 


0.9534/0.8859 


B-2 2 -TEBC 


0.9963/0.9955 


0.9967/0.9958 


0.9957/0.9957 


B-JSD-TEBC 


0.8704/0.7409 


0.9212/0.8046 


0.9396/0.8709 


B-ML-TEBC 


0.8798/0.7518 


0.9314/0.8147 


0.9486/0.8807 


SME 


0.4469/0.7508 


0.3993/0.5139 


0.4686/0.7009 


«i-SUMMET 


0.7528/0.7562 


0.6077/0.4997 


0.7484/0.7088 


ii-PLUMMET 


0.5371/0.7649 


0.4859/0.5135 


0.5623/0.7070 


2^-SEB 


0.3620/0.0802 


0.3635/0.0795 


0.3299/0.0951 


JSD-SEB 


0.4414/0.1373 


0.4634/0.1414 


0.3656/0.1217 


ML-SEB 


0.4450/0.1373 


0.4674/0.1419 


0.3689/0.1218 


Best JS Value 


0.0109/0.0128 


0.0111/0.0129 


0.0099/0.0113 


Worst JS Value 


0.0186/0.0190 


0.0184/0.0189 


0.0191/0.0184 



results w.r.t certain constraints' number k = 0.2 X m(left) and 0.05 X m(right) 



Table 2 Performance Score (w.r.t. JS-divergcncc) of different Maxents on Synthesized 
Datasets 



8 to all the uncertain constraints, it is often in a dilemma: If a small 8 is used, the 
feasible region might be far from the underlying real distribution, and hence the op- 
timization procedure can only converge to a poor solution; If a large 8 is used, there 
might be a great chance to underfit the sample. Compared with standard Maxent and 
Zi-PLUMMET, Zi-SUMMET (using the selective-update strategy) is relatively stable. 
However, ii-SUMMET with any unified 8 can not perform as well as TEBC Maxents. 
Note that there is no practical guidance to determine different 8 for different uncertain 
constraints. Even though this guidance exists, it is often a prohibitively-complicated 
task to find the optimal 8 for each uncertain constraint. This indeed reflects the im- 
plementation complexity of the existing Maxents and highlights the advantage of the 
parameter-free characteristic of TEBC Maxents. The idea is also supported by the 
experiment result of SEB Maxents, which achieve better performance than SME and 
GME on average although are still less effective than TEBCs. 
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- Data 
Meth£xT~~~~-----^_ 


U (U, I) 


I N(n 1 2 \l 


JV (6, 1 ) 


Sample 


0.8539/0.7594 


0.9179/0.8291 


0.3802/0.0547 


F-^-TEBC 


0.9331/0.9871 


0.9190/1.0000 


1.0000/1.0000 


F-JSD-TEBC 


0.9989/0.9987 


0.9993/0.9840 


0.9914/0.9371 


F-ML-TEBC 


1.0000/1.0000 


1.0000/0.9856 


0.9982/0.9471 


B-ij-TEBC 


0.9293/0.9863 


0.9191/0.9993 


0.9977/0.9958 


B-JSD-TEBC 


0.9986/0.9977 


0.9992/0.9836 


0.9884/0.9313 


B-ML-TEBC 


0.9996/0.9989 


0.9998/0.9851 


0.9952/0.9413 


SME 


0.0000/0.1225 


0.0000/0.0115 


0.0000/0.0000 


(i-bUMMET 


0.7169/0.0719 


0.7764/0.0000 


0.8133/0.7731 
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O ^71 7 /D 7^0. 
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UT CUD 

IVIIj-O-CJ-D 


u.yoz 1 / v. i ooo 


u. y i ^y/ u.ooy^i 


n fii ^9 /n 1 h 

U.OlOZ/U.lOlo 


Best ELL Value 


6.4128/6.4179 


6.2935/6.3022 


6.5928/6.6009 


Worst ELL Value 


6.5864/6.4879 


6.5226/6.3587 


6.6583/6.6394 


^——^^^ Data 
Metho^T~~~------^^ 


X l, iu J 






Sample 


0.4724/0.0936 


0.6084/0.0911 


0.7054/0.0236 


F-2 2 -TEBC 


1.0000/1.0000 


1.0000/1.0000 


1.0000/1.0000 


F-JSD-TEBC 


0.9239/0.7299 


0.9642/0.7823 


0.9770/0.8411 


F-ML-TEBC 


0.9298/0.7408 


0.9693/0.7930 


0.9803/0.8529 


B-Zj-TEBC 


0.9985/0.9967 


0.9982/0.9962 


0.9988/0.9959 


B-JSD-TEBC 


0.9221/0.7262 


0.9629/0.7784 


0.9755/0.8360 


B-ML-TEBC 


0.9279/0.7369 


0.9679/0.7889 


0.9788/0.8477 


SME 


0.0000/0.0000 


0.0000/0.6955 


0.0000/0.7849 


«i-SUMMET 


0.9061/0.8648 


0.8935/0.6993 


0.9371/0.7977 


ii-PLUMMET 


0.6539/0.8531 


0.7555/0.7008 


0.8527/0.7923 


«^-SEB 


0.5335/0.0147 


0.6291/0.0000 


0.7546/0.0000 


JSD-SEB 


0.6897/0.1989 


0.7776/0.1984 


0.8075/0.1278 


ML-SEB 


0.6917/0.1990 


0.7793/0.1989 


0.8085/0.1279 


Best ELL Value 


6.5460/6.5533 


6.5405/6.5491 


6.5916/6.5834 


Worst ELL Value 


6.6129/6.5873 


6.6267/6.5820 


6.7347/6.6178 



results w.r.t certain constraints' number k = 0.2 X m(left) and 0.05 X m(right) 



Table 3 Performance Scores (w.r.t. Expected Log Loss) of different Maxents on Synthesized 
Datasets 



In summary, TEBC Maxents show their effectiveness and stability, as demonstrated 
in result tables, especially in Tables [B] and We would like to stress that TEBC 
Maxents are also easy to implement since only certain constraints and a single TEB 
constraint are involved. This can help to demonstrate our previous theoretical justifica- 
tion on TEB. In addition, note that the performance of SEB-based models can consis- 
tently outperform SME, Zi-SUMMET and /i-PLUMMET, though the SEB correction 
is indeed not exact. This observation indicates that the framework of our generalized 
Maxent proposed in Section T4. 21 has gains in itself. 



6.3 TEB-Lidstone vs. Comparative Lidstone and Good- Turing Estimators 

When the reliable prior information is not available, Lidstone and Good- Turing esti- 
mators are often used in many applications since they are often effective enough, and 



20 



Data 
MethoiT~~~-------^_ 


LJcxtcr 








ooiicir 


Sample 


0.3852/0.4350 


0.8212/0.8968 


0.5107/0.5693 


0.3323/0.3176 


F-^-TEBC 


0.9415/1.OOOO 


0.8000/0.9010 


0.8261/0.9997 


0.9513/1.0000 


F-JSD-TEBC 


0.9955/0.7596 


0.9996/0.9995 


0.9966/0.9565 


0.9956/0.8274 


F-ML-TEBC 


1.0000/0.7631 


0.9999/1.0000 


1.0000/0.9598 


1.0000/0.8328 


B-i^-TEBC 


0.9395/0.9983 


0.7998/0.9008 


0.8250/1. 0000 


0.9476/0.9948 


B-JSD-TEBC 


0.9952/0.7591 


0.9996/0.9995 


0.9951/0.9550 


0.9942/0.8239 


B-ML-TEBC 


0.9996/0.7624 


1.0000/0.9999 


0.9984/0.9582 


0.9984/0.8289 


SME 


0.0000/0.0697 


0.0000/0.0065 


0.0000/0.0691 


0.0000/0.0737 


Zi-SUMMET 


0.3694/0.0000 


0.2099/0.0000 


0.1384/0.0000 


0.0962/0.0000 


Zi-PLUMMET 


0.1956/0.0819 


0.2588/0.1281 


0.0867/0.0842 


0.1148/0.1567 


i^-SEB 


0.5313/0.4094 


0.7533/0.7945 


0.5821/0.5497 


0.5605/0.3684 


JSD-SEB 


0.8874/0.5702 


0.9853/0.9716 


0.8675/0.6704 


0.8117/0.5095 


ML-SEB 


0.8910/0.5709 


0.9863/0.9719 


0.8704/0.6709 


0.8144/0.5092 


Best JS Value 


0.0144/0.0152 


0.0132/0.0148 


0.0132/0.0145 


0.0134/0.0142 


Worst JS Value 


0.0213/0.0213 


0.0337/0.0304 


0.0227/0.0228 


0.0201/0.0197 


results w.r.t certain constraints' number k = 0.2 X m(left) and 0.05 X m(right) 
Table 4 Performance Score (w.r.t. JS-divergence) of different Maxents on Real-world Datasets 


~~ Data 

MethocT~~~-------_^_ 


Dexter 


Statlog 


ISOLET 


Sonar 


Sample 


0.9023/0.4710 


0.9811/0.9884 


0.8680/0.7702 


0.9696/0.9648 


F-^-TEBC 


0.9463/1.0000 


0.9351/0.9572 


0.8621/0.9833 


0.9838/1.0000 


F-JSD-TEBC 


0.9991/0.7678 


0.9998/0.9999 


0.9986/0.9974 


0.9997/0.9954 


F-ML-TEBC 


1.0000/0.7714 


0.9999/1.0000 


1.0000/1.0000 


1.0000/0.9959 


B-^-TEBC 


0.9451/0.9970 


0.9342/0.9569 


0.8618/0.9919 


0.9823/0.9992 


B-JSD-TEBC 


0.9990/0.7671 


0.9998/0.9998 


0.9981/0.9963 


0.9995/0.9952 


B-ML-TEBC 


0.9998/0.7706 


1.0000/0.9999 


0.9995/0.9988 


0.9998/0.9955 


SME 


0.0000/0.5313 


0.5737/0.4980 


0.0000/0.5079 


0.7454/0.5920 


li-SUMMET 


0.9470/0.4908 


0.8644/0.6724 


0.8252/0.4840 


0.8715/0.8095 


ii-PLUMMET 


0.8330/0.5286 


0.0000/0.0000 


0.6469/0.0000 


0.0000/0.0000 


Z^-SEB 


0.7848/0.0000 


0.9028/0.8834 


0.7169/0.5678 


0.9333/0.9203 


JSD-SEB 


0.9755/0.5619 


0.9962/0.9920 


0.9557/0.8100 


0.9889/0.9740 


ML-SEB 


0.9761/0.5626 


0.9964/0.9920 


0.9567/0.8103 


0.9891/0.9740 


Best ELL Value 


6.3275/6.3600 


5.0336/5.0862 


6.0809/6.2262 


4.6626/4.6838 


Worst ELL Value 


6.5260/6.3918 


5.7561/5.4920 


6.2318/6.2896 


5.3435/5.1230 



results w.r.t certain constraints' number k = 0.2 X m(left) and 0.05 X m(right) 



Table 5 Performance Scores (w.r.t. Expected Log Loss) of different Maxents on Real-world 
Datasets 



more efficient than Maxent. This set of experiments is constructed to verify the ad- 
vantage of TEB-Lidstone (F-Lidstone and B-Lidstone) over the involved Lidstone and 
Good- Turing estimators. 

6.3.1 Lidstone and Good-Turing Estimators 

In this subsection, we evaluate the effectiveness of F-Lidstone and B-Lidstone esti- 
mators which have been described in Section [4.31 Various other Lidstone models, i.e., 
Laplace estimator (Laplace) and Expected Likelihood Estimator (ELE), serve as com- 
parative algorithms. Motivated by TEB-Lidstone, we also derive a Lidstone estimator 
from SEB in a similar way to TEB-Lidstone. To identify the rate of probability cor- 
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Algorithms 


Synthesized 


Real-world 


Average 


Sample 


0.2164/0.2821 


0.5123/0.5547 


0.3644/0.4184 


F-^-TEBC 


0.9521/0.9923 


0.8797/0.9752 


0.9159/0.9837 


F-JSD-TEBC 


0.9551/0.9025 


0.9968/0.8858 


0.9760/0.8941 


F-ML-TEBC 


0.9618/0.9091 


0.9999/0.8889 


0.9809/0.8990 


B-^-TEBC 


0.9495/0.9893 


0.8780/0.9735 


0.9137/0.9814 


B-JSD-TEBC 


0.9524/0.8992 


0.9960/0.8844 


0.9742/0.8918 


B-ML-TEBC 


0.9590/0.9057 


0.9991/0.8874 


0.9791/0.8965 


SME 


0.2859/0.4508 


0.0000/0.0547 


0.1429/0.2528 


«i-SUMMET 


0.4580/0.4346 


0.2035/0.0000 


0.3307/0.2173 


Zi-PLUMMET 


0.3634/0.4549 


0.1640/0.1127 


0.2637/0.2838 


^-SEB 


0.4658/0.3313 


0.6068/0.5305 


0.5363/0.4309 


JSD-SEB 


0.5712/0.3828 


0.8879/0.6804 


0.7296/0.5316 


ML-SEB 


0.5746/0.3830 


0.8905/0.6807 


0.7326/0.5319 



results w.r.t certain constraints' number k = 0.2 X m / k = 0.05 X m 



Table 6 Overall Performance Score evaluated by JS-Divergence for Experiment Results in 
Section [631 



Algorithms 


Synthesized 


Real-world 


Average 


Sample 


0.6564/0.3086 


0.9302/0.7986 


0.7933/0.5536 


F-^-TEBC 


0.9753/0.9978 


0.9318/0.9851 


0.9536/0.9915 


F-JSD-TEBC 


0.9758/0.8789 


0.9993/0.9401 


0.9875/0.9095 


F-ML-TEBC 


0.9796/0.8866 


0.9999/0.9418 


0.9898/0.9142 


B-^-TEBC 


0.9736/0.9950 


0.9309/0.9862 


0.9522/0.9906 


B-JSD-TEBC 


0.9744/0.8756 


0.9991/0.9396 


0.9868/0.9076 


B-ML-TEBC 


0.9782/0.8831 


0.9998/0.9412 


0.9890/0.9122 


SME 


0.0000/0.2690 


0.3298/0.5323 


0.1649/0.4007 


ii-SUMMET 


0.8406/0.5345 


0.8770/0.6142 


0.8588/0.5743 


«i-PLUMMET 


0.6783/0.5199 


0.3699/0.1321 


0.5241/0.3260 


i^-SEB 


0.6618/0.1783 


0.8345/0.5929 


0.7481/0.3856 


JSD-SEB 


0.7988/0.3868 


0.9791/0.8345 


0.8889/0.6106 


ML-SEB 


0.8004/0.3871 


0.9796/0.8347 


0.8900/0.6109 



results w.r.t certain constraints' number k = 0.2 X m / k = 0.05 X m 



Table 7 Overall Performance Score evaluated by Expected Log Loss for Experiment Results 
in Section 16.21 



rection /, the SEB constraint (Formula [23]) is used as a guidance instead of the TEB 
constraint, so that the Shannon entropy of the resulting distribution S is clos- 

est to the sum of the sampling Shannon entropy S \Pn"^~\ and the Shannon entropy 
bias AS. The resulting Lidstone estimator is referred as SEB-Lidstone. 

In addition, some Good- Turing estimato rs, i.e., the simplest Good- Turing estimator 
(SimplestGT), Simple Good- Turing (SGT) [Gale and Sampson! . [l995| and a low com- 



plexity diminish ing attenuation estim ator (LC-DAE) with some asymptotic guarantee 
of performance [Orlitskv et all |2003[ |. are also involved in comparative experiments. 
Because Good- Turing estimators do not assume the bin number m is given, they only 
assign a probability sum to all the zero bins w.r.t the sample. In our case in which m 
is provided, the sum is uniformly distributed to all these zero bins. 
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6.3.2 Results 

We employ the same parameter setting in Table [1] but ignore the parameters in con- 
straints of Maxent, and then run F-Lidstone and B-Lidstone estimators as well as the 
involved comparative models on the synthesized and real-world datasets. The mean 
performance scores w.r.t. JS-Divergence and Expected Log Loss over generating times 
r and sampling times s for each dataset are summarized in Table [8] to Table ITT1 The 
overall performance of each algorithm on synthesized and real-world datasets is also 
demonstrated in Table and Table 1131 



~~ Data 


U(0,l) 


\N(0, l 2 ) 


iV(3,l 2 ) 


x 2 (io) 


/9(3,6) 


B(30,0.2) 


Sample 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


SimplestGT 


0.0916 


0.1851 


0.0062 


0.0446 


0.0679 


0.0219 


Laplace 


0.9429 


1.0000 


0.4823 


0.6518 


0.6521 


0.5082 


ELE 


0.6318 


0.7745 


0.2741 


0.3744 


0.3785 


0.2876 


SGT 


0.4896 


0.3537 


0.3754 


0.3235 


0.4152 


0.3642 


LC-DAE 


0.5680 


0.7053 


0.2532 


0.5771 


0.5202 


0.4032 


B-Lidstone 


0.9999 


0.9566 


0.9958 


0.9954 


0.9961 


0.9953 


F-Lidstone 


1.0000 


0.9573 


1.0000 


1.0000 


1.0000 


1.0000 


SEB-Lidstone 


0.7768 


0.8065 


0.5755 


0.6500 


0.6363 


0.5882 


Best JS Value 


0.0151 


0.0154 


0.0114 


0.0132 


0.0131 


0.0118 


Worst JS Value 


0.0181 


0.0177 


0.0183 


0.0187 


0.0186 


0.0187 



Table 8 Performance Score (w.r.t. JS-divergence) of different Lidstone and Good- Turing Es- 
timators on Synthesized Datasets 



~~ Data 


U(0, 1) 


|JV(0,1 2 )| 


JV(3, l 2 ) 


x 2 (io) 


0(3,6) 


5(30,0.2) 


Sample 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


0.0000 


SimplestGT 


0.1475 


0.2419 


0.0228 


0.0748 


0.1074 


0.0423 


Laplace 


0.9042 


1.0000 


0.4965 


0.6732 


0.6715 


0.5313 


ELE 


0.5798 


0.6886 


0.2860 


0.3967 


0.3995 


0.3076 


SGT 


0.4857 


0.3881 


0.3861 


0.3487 


0.4391 


0.3817 


LC-DAE 


0.7220 


0.8708 


0.3367 


0.6309 


0.5951 


0.4599 


B-Lidstone 


0.9985 


0.9258 


0.9957 


0.9957 


0.9963 


0.9956 


F-Lidstone 


1.0000 


0.9273 


1.0000 


1.0000 


1.0000 


1.0000 


SEB-Lidstone 


0.7260 


0.7349 


0.5865 


0.6689 


0.6542 


0.6071 


Best ELL Value 


6.4366 


6.2912 


6.6035 


6.5505 


6.5494 


6.5945 


Worst ELL Value 


6.4539 


6.3051 


6.6364 


6.5781 


6.5769 


6.6275 



Table 9 Performance Scores (w.r.t. Expected Log Loss) of different Lidstone and Good- Turing 
Estimators on Synthesized Datasets 



In experimental results on synthesized datasets, it can be observed that F-Lidstone 
and B-Lidstone estimators, especially F-Lidstone, outperform the other models in most 
cases. Even in the worst case, they are still more effective than most of the other 
models. Hence, F-Lidstone and B-Lidstone are on average the best performing among 
all estimators. 
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Data 
Mcthod~~~~------^_ 


Dexter 


otaxiog 




Sonar 


Sample 


0.0000 


0.4568 


0.0000 


0.0000 


SimplostGT 


0.2774 


0.3649 


0.1241 


0.1712 


Laplace 


1.0000 


0.4363 


0.8874 


0.8967 


ELE 


0.6277 


0.9992 


0.5832 


0.5480 


SGT 


0.3740 


0.4704 


0.3137 


0.5455 


LC-DAE 


0.8305 


0.0000 


0.6002 


0.6556 


B-Lidstonc 


0.9099 


1.0000 


0.9972 


0.9912 


F-Lidstone 


0.9124 


0.9986 


1.0000 


1.0000 


SEB-Lidstone 


0.7658 


0.9204 


0.7668 


0.7172 


Best JS Value 


0.0148 


0.0152 


0.0150 


0.0144 


Worst JS Value 


0.0184 


0.0171 


0.0183 


0.0182 



Table 10 Performance Score (w.r.t. JS-divergence) of different Lidstone and Good- Turing 
Estimators on Real-world Datasets 



~~ Data 

Method~~~~"~-----_^_ 


Dexter 


Statlog 


ISOLET 


Sonar 


Sample 


0.0000 


0.1702 


0.0000 


0.0000 


SimplostGT 


0.3568 


0.0000 


0.1795 


0.2092 


Laplace 


1.0000 


0.8697 


0.9138 


0.9062 


ELE 


0.6363 


1.0000 


0.5827 


0.5631 


SGT 


0.4422 


0.2000 


0.3584 


0.5509 


LC-DAE 


0.9297 


0.3987 


0.7471 


0.7839 


B-Lidstone 


0.9153 


0.9195 


0.9970 


0.9911 


F-Lidstone 


0.9177 


0.9157 


1.0000 


1.0000 


SEB-Lidstone 


0.7727 


0.6445 


0.7562 


0.7240 


Best ELL Value 


6.3450 


4.9949 


6.4027 


4.7125 


Worst ELL Value 


6.3659 


5.0012 


6.4207 


4.7338 



Table 11 Performance Scores (w.r.t. Expected Log Loss) of different Lidstone and Good- 
Turing Estimators on Real-world Datasets 



Algorithms 


Synthesized 


Real-world 


Average 


SimplestGT 


0.0695 


0.2344 


0.1520 


Laplace 


0.7062 


0.8051 


0.7556 


ELE 


0.4535 


0.6896 


0.5715 


SGT 


0.3869 


0.4259 


0.4064 


LC-DAE 


0.5045 


0.5216 


0.5130 


B-Lidstone 


0.9899 


0.9746 


0.9822 


F-Lidstone 


0.9928 


0.9777 


0.9853 


SEB-Lidstone 


0.6722 


0.7925 


0.7324 



Table 12 Overall Performance Score evaluated by JS-Divergence for Experiment Results in 
Section 16.31 



For real-world datasets, we can still come to the same conclusion that F-Lidstone 
and B-Lidstone outperform the others on average. Although Laplace and ELE could 
achieve the optimal performance on Dexter and Statlog datasets, the effectiveness of 
F-Lidstone and B-Lidstone is just slightly lower but their performance on the other 
datasets, especially ISOLET, is much better than that of the other estimators. Fur- 
ther, it can be observed that in some cases the performance scores evaluated by JS- 
Divergence and Expected Log Loss are not consistent with each other. The phenomenon 
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Algorithms 


Synthesized 


Real-world 


Average 


SimplestGT 


0.1061 


0.1864 


0.1462 


Laplace 


0.7128 


0.9224 


0.8176 


ELE 


0.4431 


0.6955 


0.5693 


SGT 


0.4049 


0.3879 


0.3964 


LC-DAE 


0.6025 


0.7149 


0.6587 


B-Lidstone 


0.9846 


0.9557 


0.9702 


F-Lidstonc 


0.9878 


0.9583 


0.9731 


SEB-Lidstonc 


0.6629 


0.7243 


0.6936 



Table 13 Overall Performance Score evaluated by Expected Log Loss for Experiment Results 
in Section 16.31 



is probably due to the incompleteness of these evaluation criteria, which could only par- 
tially reflect the similarity between the resulting distribution and the underlying real 
distribution. In this sense, the more stable the algorithm is in different similarity cri- 
teria, the more effective it could be considered to be. F-Lidstone and B-Lidstone are 
also optimal in this way. 

In summary, when Bayesian-TEB and Frequentist-TEB are applied to the Lidstone 
framework, the resulting B-Lidstone and F-Lidstone estimators could achieve excellent 
performance in the expected sense, compared with common Lidstone and Good- Turing 
estimators. Hence, it can be concluded that TEBs do make sense and can benefit the 
Lidstone framework. 

7 Conclusions and Future Work 

This paper proposes the closed-form formulae on the expected Tsallis entropy bias 
(TEB) under Frequentist and Bayesian frameworks. TEBs give the quantities on the 
difference between the expected Tsallis entropy of sampling distributions and the Tsal- 
lis entropy of the underlying real distribution. It is exact in the sense of unbiasedness 
and consistency, and hence naturally entails a quantitative re-interpretation of the 
Maxent principle. In other words, TEBs quantitatively give the answer to the ques- 
tion: Why we should choose the distribution with maximum entropy. We further use 
TEBs in Maxent and Lidstone frameworks, and both of them show promising results 
on synthesized and real-world datasets. 

In using maximum entropy approach for density estimation, a key challenge lies in 
the dilemma of uncertain constraints selection: Inappropriate choices may easily cause 
serious overfitting or underfitting problems. To deal with the challenge, a family of 
TEBC Maxents, namely Z^-TEBC, JSD-TEBC and ML-TEBC, are proposed in this 
paper. Instead of using uncertain constraints selected empirically, the proposed models 
let its Tsallis entropy converge to the underlying real distribution by compensating 
expected Tsallis entropy bias, while ensure the resulting distribution to resemble the 
sampling distribution w.r.t. i| norm, JS-divergence or Maximum Likelihood. Hence, 
the resulting distributions are optimal in the expected sense, w.r.t. the above three 
similarity criteria. 

The family of TEBC Maxents is a natural generalization of standard Maxent. 
The important difference between TEBC Maxents and standard Maxent is that, the 
constraints of the former can be derived from reliable prior information (certain con- 
straints) or analytical analysis (TEB constraint), while the latter also has to involve 



25 



uncertain constraints demanding empirically parametric selection. It turns out that 
TEBC Maxents become parameter-free in this sense. Furthermore, the analytically es- 
tablished TEB constraint can be effective to depress overfitting or underfitting, since 
it can force the resulting distribution to approach the real one by matching their en- 
tropies. 

In addition to the Maxent framework, we also demonstrate that there is a nat- 
ural connection between TEB and another widely used estimator, Lidstone estima- 
tor. Specifically, TEB can analytically identify the adaptive rate of probability cor- 
rection in Lidstone framework. As a result, TEB-Lidstone estimators (F-Lidstone and 
B-Lidstone) have been developed. 

In the future, several extra theoretical issues are worth considering. Firstly, the TEB 
results might be developed in terms of other q indexes of Tsallis entropy. The unbiased 
and consistent results w.r.t q — 1 is of special interests since Tsallis entopy is equivalent 
to Shannon entropy in this case. It can be expected that these extended results offer 
more complete criteria to further solve the overfitting and underfitting. For instance, as 
an extreme case, if we can give m independent Frequentist-TEB results, the possibility 
of overfitting and underfitting can be, in principle, ruled out and hence a task of Tri- 
nomial distribution estimation could become determined. However, other numerical 
procedures should be devised in order to globally solve the resulting model integrating 
TEB constraints w.r.t. q 7^ 2, since the new TEB constraints could be non-convex. 
Secondly, it is also interesting to develop Bayesian-TEB results w.r.t other Bayesian 
priors. Finally, we have observed that the two criteria of the estimation quality, i.e., JS- 
divergence and the expected log loss, occasionally give inconsistent evaluation results. 
Hence, it is helpful to develop more sophisticated metrics to evaluate the performance 
of density estimation. 
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Appendix A. Standard Deviation of -4*^? (T) 

Proposition 3 Given the uniform probability metric over the p( m ), the standard deviation 
ofE^J (T) denoted by STD^ (T), then 

STD (m.) (T) = 2 - Sj m) (T) (26) 

V(m - 1) ■ (m + 2) ■ (m + 3) 

Proof Combining the definition of standard deviation with the integral operator L( m_1 ) de- 
fined by Formula [7| tue have: 



which could be further transformed into 



STD™ (T) = ^ ■ £(•»-!) { [4^ (T)] 2 } - [4"° (T)] ' (27) 



Recall that Bp} (T) has been re-expressed in Formula [ffl and hence it can be verified that 

[O)] =- L ^ L (E fel « - E i=1 rf-E^ E wl ww 

Let t/f™-- 1 ) denote tfte set of all product terms in fefciVi ~ E^V? - ES* E^=l+i Pi ' Pi 
Then the calculation of ^Ep™^ * s reduced to solve the closed-form C/'" 1-1 ' w.r.t. integral 

operator L' m_1 ' . 

Lhze to the symmetry of integral domain, the properties of L^ m ~ 1 > in Proposition^ could 
be generalized as the followings: 

(a) Lf" 1 " 1 ) (pV) = L( m ~ 1 '> (pp, l<i,j<m-l,re{2,3, 4} 

(b) L( m ~ 1 )(p l - pV) = L( m ~ 1 )(p fc -pf), 1 <i,j,k,l <m - l,i ^ ^ Z,r g {1,2,3} 
fc; L( m ~ 1 )(p2 . p 2) = L (m-l)(p2 . p 2^ 1 < i,j < m- l,i ^ ^1 

fa!) -pj -p£) = L( m_1 )(p u -pt, -p^),l < i,j,k,u,v,w < m— X,i ^ j + k,u ^ v ^ 

i»,r6{2,3} 

(e) L( m_1 )(pi ■ Pj ■ Pk ■ Pi) = L,^ m ~ l '(pu -Pd -Pw 'Pa;), 1 < i,j,k,l,u,v,w,x < m- l,i ^ j 
k^l,u^v^w^x 

Based on the above properties, we can obtain a partition of [/( m— First, we partition 
r/(m—l) i n i ji ve different sets subject to the formal constraints of (a)-(e); Second, we further 
partition each set into subsets subject to different r values (if r is involved in the formal 
definition of a set). After the above two steps of decomposition, the final partition includes 10 
parts and could be represented as below: 

Partition [u^] = {{pi} , {p?} , {pj} , { PiPj } , { PlPj } , 

{PiPj} l {PiP]} . {PiPjPk} , {PiPjPt} ; {PiPjPkPl}} 

It can checked that the terms in each part give the same result with respect the integral 
operator L'" 1-1 ' . 

Therefore, if the general term formula of each part and their term numbers could be worked 
out, the general term formula of USp^ (T)l follows directly. 



27 



The following equations could be checked: 



L(™"D (pf) = ■ 

L (m-1) (p 4) = . 

£ (m_1) (Pi-pj-p*) 

i (m - 1) (p^-p!) 



2! 




(m + 1)! 
3! 




(m + 2)! 
3! 




(m + 3)! 
1 




(m + 
2! 


1)! 


(m + 
3! 


2)! 


(m + 


3)! 


(m 4- 


3)! 



(m + 2)! 
2! 



(m + 3) 



(m + 3)! 

Ira addition, the general term formula A r ( m-1 '(-) of the term number in each part is given 

by 



Ar( m - 


- 1 >(p l 2 ) = m-l 




Ar( m - 


- 1 >(p3) = -2(m-l) 




Ar( m - 


" l >(pf)=m-l 




Af( m " 


_1) (PiPj) = (m-l)(m-2) 




Ar( m - 


_1) (p l P J 2 ) = -4(m-l)(m-2) 






~ 1) (PiP?) = 2(m-l)(m-2) 






-X W)= 3-(n»-l)(m-2) 






'^(PiPjPh) = ~(m - l)(m - 2)(m - 


-3) 




"^(PiPjPf) = 2 (m - l)(m - 2)(m - 


3) 




■ 1 »te W )- (ra - 1)( ""t 


- 3)(m 



'1) 



Replace ^Ep^ (T)J 6y t/ie two sets o/ general term formulae, and we can get 



_1_. L ^^ E ^ {T) 

= ^r i ^ E ^w-^w (28) 

aePartitionJfjf" 1 )] 

(n - l) 2 (m 3 + 2m 2 - 5m + 2) 
~~ n 2 (m + l)(m + 2)(m + 3) 
Substitute Formula 1 251 into Formula\27\ and then the following equation holds: 



STD (m) (T) = / (m 3 + 2m 2 - 5m + 2) _ (m - l) 2 

™ n- V (m + l)(m + 2)(m + 3) (m + 1) 2 
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After the simplification, we obtain: 

STDi m) = 2 ■ ("-1)(™-D 

yj(m - 1) ■ (m + 2) ■ (m + 3) n (m + 1) 
Recall that E^ 7 ^ (T) = ^^yj^pxY^ > an ^ hence Formula I £61 is proved. 

It is clear that, if n ss m, STD„' (T) S3 — 1= — T 1- , 1 . , . Recall that — T 1-1 . , is uniform Bayesian- 

' ' n \ j n-(m + l) n-(m + l) J 

TEB. Hence, the estimation of uniform Bayesian-TEB is relatively stable in the case of inad- 
equate sampling. 
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